In order to prep our data from our EDA we will load, transform the date, remove all zero row from the variable we are testing hospitalizedCurrently, and sort from oldest to newest. This will allow us to easily work through our multivariate time series evaluations.
us <- read.csv("https://raw.githubusercontent.com/JaclynCoate/6373_Time_Series/master/TermProject/Data/USdaily.7.26.20.csv", header = T, strip.white = T)
us <- transform(us, date = as.Date(as.character(date), "%Y%m%d"))
us <- subset(us, select = -c(states, dateChecked, hospitalized, lastModified, total, posNeg, totalTestResultsIncrease, hash))
us[is.na(us)] <- 0
us = us[order(as.Date(us$date, format = "%Y%m%d")),]
head(us)
## date positive negative pending hospitalizedCurrently
## 187 2020-01-22 2 0 0 0
## 186 2020-01-23 2 0 0 0
## 185 2020-01-24 2 0 0 0
## 184 2020-01-25 2 0 0 0
## 183 2020-01-26 2 0 0 0
## 182 2020-01-27 2 0 0 0
## hospitalizedCumulative inIcuCurrently inIcuCumulative
## 187 0 0 0
## 186 0 0 0
## 185 0 0 0
## 184 0 0 0
## 183 0 0 0
## 182 0 0 0
## onVentilatorCurrently onVentilatorCumulative recovered death
## 187 0 0 0 0
## 186 0 0 0 0
## 185 0 0 0 0
## 184 0 0 0 0
## 183 0 0 0 0
## 182 0 0 0 0
## totalTestResults deathIncrease hospitalizedIncrease negativeIncrease
## 187 2 0 0 0
## 186 2 0 0 0
## 185 2 0 0 0
## 184 2 0 0 0
## 183 2 0 0 0
## 182 2 0 0 0
## positiveIncrease
## 187 0
## 186 0
## 185 0
## 184 0
## 183 0
## 182 0
It is difficult to assume stationarity for this data due to multiple factors. We are working under the assumption that COVID is a novel virus and cases as well as hospitalizations will eventually return to zero. This being said our current modeling techniques do things such as return to the median or mimic the previously seen trends. Also, we see a heavy wandering trend in both new cases and hospitalization would be dependent on this as well as time. We will review the data and see what, if any, non-stationary components reveal themselves and model the data accordingly.
Traits:
ggplot(data = us, aes(x=date, y=hospitalizedCurrently))+
geom_line(color="orange")+
labs(title = "Current COVID Hospitalized Cases US", y = "Thousands", x = "") +
theme_fivethirtyeight()
When reviewing the variables and the scatter plot from our previous EDA we can see that there are some correlations that were expected, for example currentHospitalizations is correlated with the variables that reflect ICU and Ventilator patients. These metrics wouldn’t exist without hospitalizations. These variables also cannot help up predict hospitalizations because they after occurrences. So, we will actually be leveraging variables such as positive increase in order to see if there is some sort of correlation between hospitalized patients and the number of positive cases.
ggplot(data = us, aes(x=date, y=positiveIncrease))+
geom_line(color="orange")+
labs(title = "Increase in COVID Cases US", y = "Thousands", x = "") +
theme_fivethirtyeight()
Since we are working with a seasonal and transformation component, but it requires an additional transformation for the total positive COVID cases to become stationary for evaluation, we will only use positive increase of cases to predict currently hospitalized cases for the VAR model.
#Positive Cases Transformations
#pos.diff = artrans.wge(us$positive, phi.tr = 1, lag.max = 100)
#pos.diff.2 = artrans.wge(pos.diff, phi.tr = 1, lag.max = 100)
#pos.trans = artrans.wge(pos.diff.2,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)
#Increase Positive Transformation
inpos.diff = artrans.wge(us$positiveIncrease, phi.tr = 1, lag.max = 100)
inpos.trans = artrans.wge(inpos.diff,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)
#Current Hospitalization Transformations
us.diff = artrans.wge(us$hospitalizedCurrently, phi.tr = 1, lag.max = 100)
currhosp.trans = artrans.wge(us.diff,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)
#VARselect to choose lag
VARselect(cbind(currhosp.trans, inpos.trans), lag.max = 10, type= "both")
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 7 7 2 7
##
## $criteria
## 1 2 3 4 5
## AIC(n) 3.093857e+01 3.083329e+01 3.079091e+01 3.076976e+01 3.069554e+01
## HQ(n) 3.101650e+01 3.095018e+01 3.094676e+01 3.096458e+01 3.092932e+01
## SC(n) 3.113058e+01 3.112131e+01 3.117493e+01 3.124979e+01 3.127158e+01
## FPE(n) 2.731960e+13 2.459304e+13 2.357880e+13 2.309552e+13 2.145770e+13
## 6 7 8 9 10
## AIC(n) 3.066656e+01 3.051427e+01 3.053424e+01 3.059674e+01 3.059649e+01
## HQ(n) 3.093930e+01 3.082598e+01 3.088492e+01 3.098638e+01 3.102509e+01
## SC(n) 3.133860e+01 3.128233e+01 3.139830e+01 3.155681e+01 3.165257e+01
## FPE(n) 2.086403e+13 1.793906e+13 1.833019e+13 1.955149e+13 1.959499e+13
#fit model
usdiff.var.fit = VAR(cbind(currhosp.trans, inpos.trans), type = "both", p = 2)
#AIC: 30.51427
#7 day predictions
pred.var7 = predict(usdiff.var.fit, n.ahead = 7)
pred.var7$fcst$currhosp.trans[,1:3]
## fcst lower upper
## [1,] -42.12350 -2942.410 2858.163
## [2,] -385.17371 -3346.493 2576.145
## [3,] -80.65766 -3262.529 3101.214
## [4,] -124.14161 -3327.324 3079.041
## [5,] -46.62794 -3284.538 3191.282
## [6,] -43.58380 -3288.207 3201.039
## [7,] -11.23201 -3262.277 3239.813
#90 day predictions
pred.var90 = predict(usdiff.var.fit, n.ahead = 90)
invisible(pred.var90$fcst$currhosp.trans)
startingPoints7 = us$hospitalizedCurrently[126:132]
currHospForecasts7 = pred.var7$fcst$currhosp.trans[,1:3] + startingPoints7
startingPoints90 = us$hospitalizedCurrently[43:132]
currHospForecasts90 = pred.var90$fcst$currhosp.trans[,1:3] + startingPoints90
#7 day Forecasts
plot(seq(1,132,1), us$hospitalizedCurrently, type = "l", xlim = c(0,139), ylim = c(0,62000), ylab = "Currently Hospitalized COVID Patients", main = "7 Day Currently Hospitalized Patients Forecast")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$fcst, type = "l", col = "red")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$upper, type = "l", col = "blue")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$lower, type = "l", col = "blue")
- 90 Day
#90 day Forecasts
plot(seq(1,132,1), us$hospitalizedCurrently, type = "l", xlim = c(0,222), ylim = c(0,62000), ylab = "Currently Hospitalized COVID Patients", main = "90 Day Currently Hospitalized Patients Forecast")
lines(seq(133,222,1), as.data.frame(currHospForecasts90)$fcst, type = "l", col = "red")
lines(seq(133,222,1), as.data.frame(currHospForecasts90)$upper, type = "l", col = "blue")
lines(seq(133,222,1), as.data.frame(currHospForecasts90)$lower, type = "l", col = "blue")
varASE7 = mean((us$hospitalizedCurrently[126:132]-currHospForecasts7[1:7])^2)
varASE7
## [1] 25178.55
varASE7 = mean((us$hospitalizedCurrently[43:132]-currHospForecasts90[1:90])^2)
varASE7
## [1] 10791.59
Since we have been looking at additional regressors for all our above models and know that the best regressor to leverage will be positive increase of COVID cases. This regressor reflects similar behavior to that of current hospitalizations and can be model properly against hospitalizations. 1. Create Current Hospitalized ts variable
tUScurrhop = ts(us$hospitalizedCurrently)
tUSx = data.frame(positiveIncrease = ts(us$positiveIncrease))
fit.mlp.new = mlp(ts(tUSx), reps = 50, comb = "mean")
mlp.new.fore7 = forecast(fit.mlp.new, h = 7)
invisible(mlp.new.fore7)
mlp.new.fore90 = forecast(fit.mlp.new, h = 90)
invisible(mlp.new.fore90)
new.regressor7 <- data.frame(c(us$positiveIncrease, mlp.new.fore7$mean))
invisible(new.regressor7)
new.regressor90 <- data.frame(c(us$positiveIncrease, mlp.new.fore90$mean))
invisible(new.regressor90)
mlp.fit1 = mlp(tUScurrhop, xreg = tUSx, outplot = T, comb = "mean")
plot(mlp.fit1)
currhosp.fore7 = forecast(mlp.fit1, h = 7, xreg = new.regressor7)
plot(currhosp.fore7)
- Currently Hospitalized 90 Day Forecast w/ Positive Increase Regressor
currhosp.fore90 = forecast(mlp.fit1, h = 90, xreg = new.regressor90)
plot(currhosp.fore90)
#ASEmlr7 = mean((us$hospitalizedCurrently[126:132]-currhosp.fore7$mean)^2)
#ASEmlr7
tUScurrhop2 = ts(us$hospitalizedCurrently[1:125])
tUSx2 = data.frame(positiveIncrease = ts(us$positiveIncrease[1:125]))
fit.mlp.new2 = mlp(ts(tUSx2), reps = 50, comb = "mean")
mlp.new.fore7.2 = forecast(fit.mlp.new2, h = 7)
invisible(mlp.new.fore7.2)
new.regressor7.2 <- data.frame(c(us$positiveIncrease[1:125], mlp.new.fore7.2$mean))
invisible(new.regressor7.2)
mlp.fit1.2 = mlp(tUScurrhop2, xreg = tUSx2, comb = "mean")
#plot(mlp.fit1.2)
currhosp.fore7.2 = forecast(mlp.fit1.2, h = 7, xreg = new.regressor7.2)
#plot(currhosp.fore7.2)
ASEmlr7.2 = mean((us$hospitalizedCurrently[126:132]-currhosp.fore7.2$mean)^2)
ASEmlr7.2
## [1] 171151.4
#ASEmlr90 = mean((us$hospitalizedCurrently[43:132]-currhosp.fore90$mean)^2)
#ASEmlr90
tUScurrhop3 = ts(us$hospitalizedCurrently[1:42])
tUSx3 = data.frame(positiveIncrease = ts(us$positiveIncrease[1:42]))
fit.mlp.new3 = mlp(ts(tUSx3), reps = 50, comb = "mean")
mlp.new.fore7.3 = forecast(fit.mlp.new3, h = 90)
invisible(mlp.new.fore7.3)
new.regressor7.3 <- data.frame(c(us$positiveIncrease[1:42], mlp.new.fore7.3$mean))
invisible(new.regressor7.3)
mlp.fit1.3 = mlp(tUScurrhop3, xreg = tUSx2, comb = "mean")
#plot(mlp.fit1.3)
currhosp.fore7.3 = forecast(mlp.fit1.3, h = 90, xreg = new.regressor7.3)
#plot(currhosp.fore7.3)
ASEmlr7.3 = mean((us$hospitalizedCurrently[43:132]-currhosp.fore7.3$mean)^2)
ASEmlr7.3
## [1] 40431673008
We completed a default neural network model. With so many opportunities for how to actually tune neural network model we knew this would not be our best model in this case. So we moved forward with a hyper tuned neural network model that allows us to calculate many windowed ASEs and compare those model against each other.
We have leveraged the tswgewrapper code above to produce a hyperparameter tuned NN model for our ensemble model. This model will work as our higher functioning ensemble
#if (!require(devtools)) {
# install.packages("devtools")
#}
#devtools::install_github("josephsdavid/tswgewrapped", build_vignettes = TRUE)
library(tswgewrapped)
data_train.m <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[1:122], positiveIncrease = us$positiveIncrease[1:122])
data_test.m <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[123:132], positiveIncrease = us$positiveIncrease[123:132])
# search for best NN hyperparameters in given grid
model.m = tswgewrapped::ModelBuildNNforCaret$new(data = data_train.m, var_interest = "hospitalizedCurrently",
search = 'random', tuneLength = 5, parallel = TRUE,
batch_size = 50, h = 7, m = 7,
verbose = 1)
## Registered S3 method overwritten by 'lava':
## method from
## print.pcor greybox
## Aggregating results
## Selecting tuning parameters
## Fitting reps = 10, hd = 2, allow.det.season = TRUE on full training set
## - Total Time for training: : 50.664 sec elapsed
res.m <- model.m$summarize_hyperparam_results()
res.m
## reps hd allow.det.season RMSE ASE RMSESD ASESD
## 1 10 2 TRUE 3103.429 13970241 2184.689 16296534
## 2 11 3 FALSE 3129.341 14942705 2380.109 19673017
## 3 16 1 FALSE 3559.638 16480775 2047.127 15988977
## 4 16 3 TRUE 3204.752 15391189 2373.358 19509361
## 5 22 3 FALSE 3268.191 16196849 2463.200 19979406
model.m$plot_hyperparam_results()
best.m <- model.m$summarize_best_hyperparams()
best.m
## reps hd allow.det.season
## 1 10 2 TRUE
final.ase.m <- dplyr::filter(res.m, reps == best.m$reps &
hd == best.m$hd &
allow.det.season == best.m$allow.det.season)[['ASE']]
final.ase.m
## [1] 13970241
# Final Model
caret_model.m = model.m$get_final_models(subset = 'a')
caret_model.m$finalModel
## MLP fit with 2 hidden nodes and 10 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3)
## 1 regressor included.
## - Regressor 1 lags: (1,3)
## Forecast combined using the median operator.
## MSE: 1065689.5459.
# Plot Final Model
plot(caret_model.m$finalModel)
#Ensemble model
ensemble.mlp = mlp(tUScurrhop, xreg = tUSx, outplot = T, reps = 10, hd = 2, allow.det.season = T)
ensemble.mlp
## MLP fit with 2 hidden nodes and 10 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3)
## 1 regressor included.
## - Regressor 1 lags: (1,3)
## Forecast combined using the median operator.
## MSE: 1025155.9706.
#Plot ensemble model
plot(ensemble.mlp)
fore7.m = forecast(ensemble.mlp, xreg = new.regressor7, h=7)
plot(fore7.m)
fore90.m = forecast(ensemble.mlp, xreg = new.regressor90, h=90)
plot(fore90.m)
ASEmlr7 = mean((us$hospitalizedCurrently[126:132]-fore7.m$mean)^2)
ASEmlr7
## [1] 2113141
ASEmlr90 = mean((us$hospitalizedCurrently[43:132]-fore90.m$mean)^2)
ASEmlr90
## [1] 2239616945
We started our multivariate analysis using multiple regression with correlated errors models. We ended up producing two models, one with and without a trend. We predicted that a trend (or time) would be a deciding variable in which of these models would be outperform the other. However, we expected trend to be the better model. When we compared the two via their AICs, we found the MLR model without trend performed better both with an AIC and when we compared the ASEs between the two model types. When applying our domain knowledge, we did come to the conclusion that time would not necessarily be a strong determinant for the severity of COVID since we have yet to observe, and subsequently able to analyze, a full cycle of the virus effect on the US population. Once we are able to analyze data to this level, we would expect a time correlation and therefore lag/trend model performing better in the predictions of the virus’s severity and therefore hospitalized patients.
Next we completed a Vector AR (VAR) model analysis of our data. This modeling processes requires the data of both the independent and dependent variables to be stationary. Which means we are actually modeling on the residuals of the data. This also results in the predictions/forecasts being based on differences of the forecasts and therefore the forecasts have to be calculated based on the previous periods values. This modeling techniques is highly sensitive to the previously observed values, which in the case of COVID is essential for understanding the severity of COVID. Since we have yet to observe a full cycle the modeling for predicting hospitalized cases should be closely based on what we’ve previously observed. So far this is the better of the three models we’ve produced.
For our final 2 models we performed multilayered perceptron or neural network model. For our first model we used a mostly default hyper parameters for the model. The ASEs returned with cross validation are much higher than the VAR model. We see it in the billions for the 90-day window forecast. This is as expected since the NN model creates multiple scenarios and takes the average of those in order to forecast moving forward. This means even the highest and lowest are incorporated into the forecast. While some of the scenarios are not statistically likely the NN model is attempting to create a forecast that takes these possibilities into account. In order to help created a better neural network we created a hyper parameter turned model. This model produced ASEs much lower than then default parameter neural network model. The forecasts for the hyper tuned NN model are slightly less dispersed than the default NN model telling us that these predictions are more helpful. Hyper tuning the additional parameters allows the model to be closer in predicting future hospitalized cases. We performed windowed ASEs in order to pick the best hyper tuned parameters for our advanced neural network model.
Upon reviewing our 90-day forecasts and ASEs we have concluded it would not be a statistically sound decision to run a 90-day forecast and allow it to make long term decisions until we have more data. Upon performing 90-day forecasts we end up having to train the model on 20% of the data and test it on 80% of the data. This is the exact opposite of modeling a statistically sound time series for prediction. For our COVID predictions of severity we will advise to only to produce forecasts and ASE with 7-day forecasts but have produced the ASE for reference below. This applies to all of our model’s cross validation efforts and ASE calculations.
We have chosen two models to help us predict severity with the COVID hospitalized. We will leverage the vector AR model for short term forecasts This model allows us to base our predictions off of previously observed values. Since COVID has not completed a full cycle and has shown heavy wandering behavior in the realization, we feel using this method will be the closest in order to get us prepared for forecasts for our 7-day and doing short term planning for the hospitalizes supplies and staffing with prediction intervals to help us make sure we account for possible peaks and valleys in our forecast.
For our long term forecasting we have chosen to go with our hyper tuned parameter neural network model. This model is helpful for long term forecasting because it has the ability to adjust and reapply the most effective model based on the newest data input with daily updates. Since the multilayered perceptron neural network models are not based on a fixed distribution assumption, this model will not come with prediction intervals. But rather, will continue to calculate all of the probably outcomes and produce a mean forecast for us to base all planning. The hyper parameter turned neural network model takes into account multiple possibilities and therefore does give us the most statistically useful forecast for our 90-day period.
It is essential both of these models be updated daily. COVID cases and hospitalizations changes frequently and the only way to continue to forecast and prepare as effectively as possible is to have updated models with as much historical data being taken into account as possible. As we move forward with presenting these models we’ve deicded on, it is important to remember these just appear to be the most useful models, and while we can work from them, none of them will be correct.
#VAR 7-Day Forecasts w/ Prediction Intervals
plot(seq(1,132,1), us$hospitalizedCurrently, type = "l", xlim = c(0,139), ylim = c(0,62000), ylab = "Currently Hospitalized COVID Patients", main = "7 Day Currently Hospitalized Patients Forecast")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$fcst, type = "l", col = "red")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$upper, type = "l", col = "blue")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$lower, type = "l", col = "blue")
#Hyper Tuned Parameter Neural Network Model Forecast w/ Prediction Intervals
fore90.m = forecast(ensemble.mlp, xreg = new.regressor90, h=90, )
plot(fore90.m)